This report explores a dataset containing 4,898 white wines with 11 variables on quantifying the chemical properties of each wine as well as quality scores between 0 (very bad) and 10 (very excellent).
## 'data.frame': 4898 obs. of 12 variables:
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : Factor w/ 7 levels "3","4","5","6",..: 4 4 4 4 4 4 4 4 4 4 ...
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700 1st Qu.: 1.700
## Median : 6.800 Median :0.2600 Median :0.3200 Median : 5.200
## Mean : 6.855 Mean :0.2782 Mean :0.3342 Mean : 6.391
## 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900 3rd Qu.: 9.900
## Max. :14.200 Max. :1.1000 Max. :1.6600 Max. :65.800
##
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.00900 Min. : 2.00 Min. : 9.0
## 1st Qu.:0.03600 1st Qu.: 23.00 1st Qu.:108.0
## Median :0.04300 Median : 34.00 Median :134.0
## Mean :0.04577 Mean : 35.31 Mean :138.4
## 3rd Qu.:0.05000 3rd Qu.: 46.00 3rd Qu.:167.0
## Max. :0.34600 Max. :289.00 Max. :440.0
##
## density pH sulphates alcohol
## Min. :0.9871 Min. :2.720 Min. :0.2200 Min. : 8.00
## 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100 1st Qu.: 9.50
## Median :0.9937 Median :3.180 Median :0.4700 Median :10.40
## Mean :0.9940 Mean :3.188 Mean :0.4898 Mean :10.51
## 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500 3rd Qu.:11.40
## Max. :1.0390 Max. :3.820 Max. :1.0800 Max. :14.20
##
## quality
## 3: 20
## 4: 163
## 5:1457
## 6:2198
## 7: 880
## 8: 175
## 9: 5
##
## 3 4 5 6 7 8 9
## 20 163 1457 2198 880 175 5
Most wine quality is between 5 to 7 and wine with 3 score or 9 is rare. We also can find that there is no wine whose quality is below 3 or above 9.
Fixed acid(mainly tartaric acid) has a approximately normal distribution, most concentrate between 6 to 7.5. According to reference, tartaric acid can keep the chemical stability and wine color, affecting the taste of the finished produc. As tartaric acid is very acid, high volume will make wine taste rough.
## [1] 66
This is an approximately normal distribution with a little right skew. We can see there are 66 wines containing too high level of acetic acid, more than 0.6/L, which can lead to an unpleasant, vinegar taste. This negative effect could help to distinguish poor quality wines.
According to the International organization of wine, citric acid content must not exceed 1g/L. But we find a weird peak at 0.49 not around 1, which I can’t explain now. And I wonder if it has something to do with the quality of wine.
After using a log10 transformation on the x-axis, a bimodal distribution appears, having two peaks round 1.6 and 10, a bottom round 3.3. I guess this is caused by different kinds of wine varying in the amount of residual sugar, such like dry wine, sweet wines.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600
Most wine have a content of chlorides below 0.1 and the third quartile is 0.05.
## [1] 868
SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine, which could be a negative effect to the quality of wine. I’m going to use a rough approximation of ppm by using mg/L. Then, we find 868 wines over the limit.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 108.0 134.0 138.4 167.0 440.0
The distribution of total.sulfur.dioxide has more variance but less outliers than free.sulfur.dioxide.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
Most wine have a density between 0.99 and 1.00. I guess it may have relations with alcohol and residual sugar content.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.180 3.188 3.280 3.820
This is the most standard normal distribution by far in this section. PH should be influenced by fixed.acidity.
Sulphates is a wine additive which can contribute to sulfur dioxide gas (S02) levels.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
According to reference, alcohol has double effect on wine taste: One hand, only if the alcohol content is higher than 11% (v/v), mellowness of wine can be evident. Alcohol content below 10% (v/v) will make the wine taste flat instead of fat. The other hand, the high alcohol content above 14% will be evident, meanwhile bringing uncomfortable feelings, like strong hotness and bitter.
There are 4898 wine samples in the dataset with 12 features(fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol, quality). The variable quality is ordered factor variable with the followinf levels.
(worst)———————>(best)
quality:0,1,2,3,4,5,6,7,8,9,10
Other observations:
* Most wine quality is between 5 to 7.
* A notable peak for citric acid at 0.49.
* 66 wines containing too high level of acetic acid, more than 0.6/L.
* 868 wines containing free SO2 concentrations over 50 ppm.
* The median alcohol for a wine is 10.4 and the max is 14.20.
* Most wine have chlorides less than 1g/dm^3.
* Most wine have residual sugar less than 20g/dm^3.
The main features are quality and alcohol(A guess according to refs). I’d like to train a model to classify the quality of a wine. Alcohol should play an important role.
Volatile.acidity, free.sulfur.dioxide, residual.sugar and citric.acid likely contribute to the quality of a wine. But for now, I can’t tell which one contribute more.
There is a citric.acid peak at 0.49g/dm^3 in the distribution, which I feel confused.
I log_transformed the right skewd residual.sugar distribution. The transformed distribution appears bimodal with two peaks around 1.6 and 10, a bottom round 3.3.
Looking at the left subplots, we can see different dsitributions in groups divided by quality, especially alcohol, volatile.acidity and free.sulfur.dioxide, total.sulfur.dioxide and residual.sugar.
The median of wine alcohol decreases from group of 3 scores to 5 scores, then quickly increases till the last group, which has the highest median 12.5. It seems that high quality wine tends to have higher alcohol content.
##
## 3 4 5 6 7 8 9
## 19 142 1433 2182 878 173 5
##
## 3 4 5 6 7 8 9
## 1 21 24 16 2 2 0
In the first plot, the volatile.acidity rises from wine group of 3 scores to 4 scores, then decreases slowly. And the second group has the highest volatile acidity and max variance.
I divide dataset into two parts depend on whether volatile acidity content of wine is more than 0.6 g/dm^3 or not. Comparing these two parts, wine with high volatile acidity can hardly get a good score equal to or more than 7 and most get scores between 4 to 6.
##
## 3 4 5 6 7 8 9
## 15 149 1108 1808 795 151 4
##
## 3 4 5 6 7 8 9
## 5 14 349 390 85 24 1
The medians of each groups are quaite close except wine with 4 scores or 9 scores. The second group has the lowest median followed by the last group.
I divide dataset into two parts depend on whether free SO2 content of wine is more than 50 mg/L or not. And I find most wine with high free.sulfur.dioxide over 50 mg/L get a medium-quality between 5 scores to 6 scores.
Considering the components of total sulphur dioxide are free and bound forms of sulphur dioxide, I create a new feature named bound.sulfur.dioxide by substract free.sulfur.dioxide from total.sulfur.dioxide, and plot boxplot of bound.sulfur.dioxide.
The partten of free SO2 distribution is really similar to total SO2. In the first three quality groups, the medians decrease first and then increase. In the following groups, the medians of free SO2 and total SO2 both declines, but the latter reduces with more extent, which could be explained by the bound.sulfur.dioxide distribution.
So, I’d like to use free.sulfur.dioxide and bound.sulfur.dioxide as substitution of total.sulfur.dioxide to build my classify model later.
In former analysis, we find bimodal residual.sugar distribution after log_trandformation. I wonder if it has anything to do with quality, so I plot the histograms of residual.sugar faceted by quality. All groups appears bimodal except the top and bottom groups due to few samples. I think that it’s more likely to be a common phenomonon and really has little to do with wine quality. And the wine variety in sugar amount may be an explanation.
The residual.sugar distribution is similar to free SO2, declines, rises, declines, and the wine group of 4 score has the lowest median.
The density distribution across quality is quite similar to residual.sugar. And density has strong relations with alcohol, total.sulfur.dioxide, fixed.acidity. I’m going to build a linear model to predict and replace density in multivariate analysis section.
I plot citric.acid distribution faceted by quality, and we can see that all groups have a peak at 0.49 except the top and bottom groups. This unusual peak can not be explained by quality differences.
The median of each group are very close, meanwhile, the groups of score 4 and 9 have the min and max median respectively.
Though the median in each group is quite close, the first and the last group is slightly higher.
With the quality rises across groups, the median pH increases except the first group. And wine of 9 scores has the highest median.
Though the differences between medians of quality groups are small, wine of 9 scores are likely to have less chlorides. And wine of middle quality have much more variance in chlorides.
The median in each groups is quite close, but wine of fair quality have more variance in sulphates.
##
## Pearson's product-moment correlation
##
## data: wine$alcohol and wine$density
## t = -87.255, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.7908646 -0.7689315
## sample estimates:
## cor
## -0.7801376
The density of wine negatively correlates to the alcohol content, and the correlation coefficient is -0.78.
##
## Pearson's product-moment correlation
##
## data: wine$residual.sugar and wine$density
## t = 107.87, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8304732 0.8470698
## sample estimates:
## cor
## 0.8389665
The density of wine positively correlates to the alcohol content, and the correlation coefficient is 0.84.
This matrix plot indicates other features(total.sulfur.dioxide, free.sulfur.dioxide, bound.sulfur.dioxide, fixed.acidity) related to density. Total SO2 and bound SO2 both have a moderately positive relationship with density. Free SO2 and fixed acidity both have a slightly positive relationship with density.
Quality correlates strongly with alcohol, volatile.acidity and free.sulfur.dioxide.
In the range of fair or high quality wine, the better the wine is, the more alcohol it has. But, in the range of low quality wine, wine of 4 scores has a lower median than wine of 3 scores.
Only four wine with more than 0.6 g/dm^3 volatile acidity, get scores more than 7. Therefore Wine with high volatile acidity can hardly get a good score.
Most wine with high free.sulfur.dioxide over 50 mg/L get a medium-quality between 5 scores to 6 scores.
The density of wine is strongly correlated with alcohol and residual.sugar. The higher alcohol wine has, then the lower density. The more residual sugar wine has, then the higer density. Besides, total SO2, bound SO2, free SO2 and fixed acidity have a weaker positive relationship with density than residual sugar.
This above plots elaborate the former phenomenon that more residual sugar and less alcohol, then higher density. And vice versa.
Next, I start to build the linear model to predict the density of wine.
##
## Calls:
## m1: lm(formula = density ~ alcohol + residual.sugar + total.sulfur.dioxide,
## data = wine)
## m2: lm(formula = density ~ alcohol + residual.sugar + free.sulfur.dioxide +
## bound.sulfur.dioxide, data = wine)
## m3: lm(formula = density ~ alcohol + residual.sugar + bound.sulfur.dioxide,
## data = wine)
## m4: lm(formula = density ~ alcohol + residual.sugar + bound.sulfur.dioxide +
## fixed.acidity + free.sulfur.dioxide, data = wine)
## m5: lm(formula = density ~ alcohol + residual.sugar + bound.sulfur.dioxide +
## fixed.acidity, data = wine)
##
## ====================================================================================
## m1 m2 m3 m4 m5
## ------------------------------------------------------------------------------------
## (Intercept) 1.003*** 1.003*** 1.003*** 0.999*** 0.999***
## (0.000) (0.000) (0.000) (0.000) (0.000)
## alcohol -0.001*** -0.001*** -0.001*** -0.001*** -0.001***
## (0.000) (0.000) (0.000) (0.000) (0.000)
## residual.sugar 0.000*** 0.000*** 0.000*** 0.000*** 0.000***
## (0.000) (0.000) (0.000) (0.000) (0.000)
## total.sulfur.dioxide 0.000***
## (0.000)
## free.sulfur.dioxide -0.000*** -0.000***
## (0.000) (0.000)
## bound.sulfur.dioxide 0.000*** 0.000*** 0.000*** 0.000***
## (0.000) (0.000) (0.000) (0.000)
## fixed.acidity 0.001*** 0.001***
## (0.000) (0.000)
## ------------------------------------------------------------------------------------
## R-squared 0.911 0.915 0.914 0.935 0.935
## adj. R-squared 0.911 0.915 0.914 0.935 0.935
## sigma 0.001 0.001 0.001 0.001 0.001
## F 16738.603 13217.503 17440.713 14157.987 17649.239
## p 0.000 0.000 0.000 0.000 0.000
## Log-likelihood 27448.397 27564.052 27540.255 28226.250 28219.529
## Deviance 0.004 0.004 0.004 0.003 0.003
## AIC -54886.794 -55116.104 -55070.510 -56438.500 -56427.058
## BIC -54854.311 -55077.124 -55038.027 -56393.023 -56388.079
## N 4898 4898 4898 4898 4898
## ====================================================================================
The fifth linear model can account 93.5% of the variance in the density of wine, so I’d like to use the combination of alcohol, residual.sugar, bound.sulfur.dioxide and fixed.acidity to replace density when build model to predict the quality of wine.
## Source: local data frame [4 x 3]
## Groups: high.volatile.acidity [?]
##
## high.volatile.acidity high.free.sulfur.dioxide n
## <chr> <chr> <int>
## 1 volatile.acidity<=0.6 free.sulfur.dioxide<=50 3970
## 2 volatile.acidity<=0.6 free.sulfur.dioxide>50 862
## 3 volatile.acidity>0.6 free.sulfur.dioxide<=50 60
## 4 volatile.acidity>0.6 free.sulfur.dioxide>50 6
Wine is divided into 4 groups:
| data | size | features |
|---|---|---|
| low free SO2, low volatile acidity | 3970 | The quality distribution is similar to the whole. |
| high free SO2, low volatile acidity | 862 | Most wine get score between 5 to 6, no wine get 9. |
| low free SO2, high volatile acidity | 60 | Most wine get score between 5 to 7, few get 9. |
| high free SO2, high volatile acidity | 6 | Most wine get score between 4 to 6, no wine get 3 or more than 7. |
In short, wine with high free SO2 or high volatile acidity are much less likely to have a high quality.
Next, I start to build the classify model.
##
## Call:
## randomForest(formula = quality ~ ., data = new_wine, mtry = 10, ntree = 170)
## Type of random forest: classification
## Number of trees: 170
## No. of variables tried at each split: 10
##
## OOB estimate of error rate: 28.26%
## Confusion matrix:
## 3 4 5 6 7 8 9 class.error
## 3 40 0 0 0 0 0 0 0.0000000
## 4 0 49 74 40 0 0 0 0.6993865
## 5 0 13 1048 382 14 0 0 0.2807138
## 6 0 2 278 1771 143 4 0 0.1942675
## 7 0 2 17 321 528 12 0 0.4000000
## 8 0 0 2 46 44 83 0 0.5257143
## 9 0 0 0 0 0 0 20 0.0000000
## MeanDecreaseGini
## fixed.acidity 280.5509
## volatile.acidity 366.2120
## citric.acid 281.9894
## residual.sugar 324.5029
## chlorides 296.9875
## free.sulfur.dioxide 363.8392
## pH 315.3327
## sulphates 289.7706
## alcohol 484.9548
## bound.sulfur.dioxide 347.6010
## predicted
## actual 3 4 5 6 7 8 9
## 3 11 0 0 0 0 0 0
## 4 0 49 0 0 0 0 0
## 5 0 0 428 0 0 0 0
## 6 0 0 0 621 0 0 0
## 7 0 0 0 0 271 0 0
## 8 0 0 0 0 0 55 0
## 9 0 0 0 0 0 0 3
I build a RandomForest model to classify the quality of wine. In order to solve the unbalance of data set, I add wine data with 3 scores one time and wine data with 9 scores three times. Though all the prediction of my test data are the same as actual, the model error rate is 28.26%. Wine with 4, 7, 8 scores are the top three difficult group to predict. And the most important feature is alcohol followed by volatile.acidity and free.sulfur.dioxide.
The third picture shows the impact of high SO2 and volatile acidity to the quality of wine. Wine with high free SO2 or high volatile acidity are much less likely to have a high quality. Wine with both high free SO2 and high volatile acidity only get score between 4 to 6, no more than 7.
The first plot show the relationships in residual sugar, alcohole and density. More residual sugar and less alcohol, then higher density. And vice versa. The next matrix plot shows relationships between density and other features. Total.sulfur.dioxide and bound.sulfur.dioxide both have a moderately positive relation with density. Free.sulfur.dioxide and fixed.acidity both have a slightly positive relation with density.
Yes, I buid two models. The first is a linear model to predict the density of wine and can account 93.5% of the variance in the density of wine.
The second is a RandomForest model to classify the quality of wine. All my test data are predicted correctly, but the model error rate is 28.26%. Wine with 4, 7, 8 scores are difficult to predict. Another limitation is that it can not predict neither wine with 10 scores nor less than 3. This is due to the absence of the corresponding samples.
The residual sugar distribution of wine appears to be bimodal on log scale, as well as being faceted by quality. It perhaps due to the preference of residual sugar content varying in two different ranges, such like dry wine, sweet wines. There are two peaks round 1.6 and 10 points, a bottom round 3 points.
According to the International organization of wine, citric acid content must not exceed 1g/L. But there is a weird peak at 0.49 not around 1. After faceted by quality, we can see that all groups have a peak at 0.49 except the first and the last groups, which means this unusual peak can not be explained by quality differences. I’m still confused about the weird peak.
In whole distribution, most wines have a citric acid content between 0.2 and 0.5g/dm^3 and the median citric acid content is 0.32 closed to the mean 0.3341915. 307 wines have a citric acid content 0.3g/dm^3 making the highest peak.
The plot indicates the impact of high free SO2 and volatile acidity to the quality of wine. Higher free SO2 and volatile acidity the wine contain, the less possible for high quality. Look at the sub-plot in the bottom right corner, wine with both high free SO2 and high volatile acidity only get score between 4 to 6, no more than 7.
The Wine data set contains 4898 wine samples across 12 variables. I started by googling the meanings of variables and influences to wine quality. Then I observed the single variable distributions and explored the quality across many variables. I separated total.sulfur.dioxide into free and bound two parts. After studying the relation between density and other features, I builded a linear model to replace density variable. Eventually, I build a RandomForest model to classify the quality of wine.
At first, I thought fixed.acidity meight be one of the most relative features to quality. But the median of fixed acidity in each quality group were quite close, making fixed acidity less important. I explored the quality of wines across variables and found the medians of alcohol content were quite different in groups. It declined, then increased to the highest point 12.5. Wine with 9 scores tended to have higer alcohol content and wine with 5 scores less. When I separated data set into four parts(low free SO2 & low volatile acidity, high free SO2 & low volatile acidity, low free SO2 & high volatile acidity, high free SO2 & high volatile acidity) and ploted the quality distribution, it becomed so obvious that wine with high free SO2 or high volatile acidity are much less likely to have a high quality. As for the RandomForest model I made at last, I used 10 features(alcohol, volatile.acidity, free.sulfur.dioxide, bound.sulfur.dioxide, residual.sugar, pH, chlorides, sulphates, citric.acid, fixed.acidity) and all wine samples were included.
The unbalance problem of data set is serious. This data set contain 4898 wines, but only 30 wines of 3 scores, 5 wines of 9 scores and no wines of 1 or 2 or 10 scores. I added wine data of 3 scores one time and wine data of 9 scores three times before training model. But, I still couldn’t dealing with the absence and the model wouldn’t recognize these three quality categories. Besides, this classify model has a 28.26% error rate, mainly due to the poor recognization performance of the medium-quality wines(4 scores to 7 scores). In the further analysis, informations of absent wine should be added. And I should explore more features to distinguish the medium-quality wines in detail.